Dynamic outlier bias reduction system and method

ABSTRACT

A system and method is described herein for data filtering to reduce functional, and trend line outlier bias. Outliers are removed from the data set through an objective statistical method. Bias is determined based on absolute, relative error, or both. Error values are computed from the data, model coefficients, or trend line calculations. Outlier data records are removed when the error values are greater than or equal to the user-supplied criteria. For optimization methods or other iterative calculations, the removed data are re-applied each iteration to the model computing new results. Using model values for the complete dataset, new error values are computed and the outlier bias reduction procedure is re-applied. Overall error is minimized for model coefficients and outlier removed data in an iterative fashion until user defined error improvement limits are reached. The filtered data may be used for validation, outlier bias reduction and data quality operations.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 13/772,212, filed Feb. 20, 2013 by Richard Bradley Jones andentitled “Dynamic Outlier Bias Reduction System and Method,” which is acontinuation-in-part patent application that claims the benefit of andpriority to U.S. Non-Provisional patent application Ser. No. 13/213,780,filed Aug. 19, 2011 by Richard Bradley Jones and entitled “DynamicOutlier Bias Reduction System and Method,” all of which are incorporatedherein by reference in their entirety.

STATEMENTS REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

Not applicable.

REFERENCE TO A MICROFICHE APPENDIX

Not applicable.

FIELD OF THE INVENTION

The present invention relates to the analysis of data where outlierelements are removed (or filtered) from the analysis development. Theanalysis may be related to the computation of simple statistics or morecomplex operations involving mathematical models that use data in theirdevelopment. The purpose of outlier data filtering may be to performdata quality and data validation operations, or to computerepresentative standards, statistics, data groups that have applicationsin subsequent analyses, regression analysis, time series analysis orqualified data for mathematical models development.

BACKGROUND

Removing outlier data in standards or data-driven model development isan important part of the pre-analysis work to ensure a representativeand fair analysis is developed from the underlying data. For example,developing equitable benchmarking of greenhouse gas standards for carbondioxide (CO₂), ozone (O₃), water vapor (H₂O), hydrofluorocarbons (HFCs),perfluorocarbons (PFCs), chlorofluorocarbons (CFCs), sulfur hexafluoride(SF₆), methane (CH₄), nitrous oxide (N₂O), carbon monoxide (CO),nitrogen oxides (NOx), and non-methane volatile organic compounds(NMVOCs) emissions requires that collected industrial data used in thestandards development exhibit certain properties. Extremely good or badperformance by a few of the industrial sites should not bias thestandards computed for other sites. It may be judged unfair orunrepresentative to include such performance results in the standardcalculations. In the past, the performance outliers were removed via asemi-quantitative process requiring subjective input. The present systemand method is a data-driven approach that performs this task as anintegral part of the model development, and not at the pre-analysis orpre-model development stage.

The removal of bias can be a subjective process wherein justification isdocumented in some form to substantiate data changes. However, any formof outlier removal is a form of data censoring that carries thepotential for changing calculation results. Such data filtering may ormay not reduce bias or error in the calculation and in the spirit offull analysis disclosure, strict data removal guidelines anddocumentation to remove outliers needs to be included with the analysisresults. Therefore, there is a need in the art to provide a new systemand method for objectively removing outlier data bias using a dynamicstatistical process useful for the purposes of data quality operations,data validation, statistic calculations or mathematical modeldevelopment, etc. The outlier bias removal system and method can also beused to group data into representative categories where the data isapplied to the development of mathematical models customized to eachgroup. In a preferred embodiment, coefficients are defined asmultiplicative and additive factors in mathematical models and alsoother numerical parameters that are nonlinear in nature. For example, inthe mathematical model, f(x,y,z)=a*x+b*y^(c)+d*sin(ez)+f, a, b, c, d, e,and f are all defined as coefficients. The values of these terms may befixed or part of the development of the mathematical model.

BRIEF SUMMARY

A preferred embodiment includes a computer implemented method forreducing outlier bias comprising the steps of: selecting a biascriteria; providing a data set; providing a set of model coefficients;selecting a set of target values; (1) generating a set of predictedvalues for the complete data set; (2) generating an error set for thedataset; (3) generating a set of error threshold values based on theerror set and the bias criteria; (4) generating, by a processor, acensored data set based on the error set and the set of error thresholdvalues; (5) generating, by the processor, a set of new modelcoefficients; and (6) using the set of new model coefficients, repeatingsteps (1)-(5), unless a censoring performance termination criteria issatisfied. In a preferred embodiment, the set of predicted values may begenerated based on the data set and the set of model coefficients. In apreferred embodiment, the error set may comprise a set of absoluteerrors and a set of relative errors, generated based on the set ofpredicted values and the set of target values. In another embodiment,the error set may comprise values calculated as the difference betweenthe set of predicted values and the set of target values. In anotherembodiment, the step of generating the set of new coefficients mayfurther comprise the step of minimizing the set of errors between theset of predicted values and the set of actual values, which can beaccomplished using a linear, or a non-linear optimization model. In apreferred embodiment, the censoring performance termination criteria maybe based on a standard error and a coefficient of determination.

Another embodiment includes a computer implemented method for reducingoutlier bias comprising the steps of: selecting an error criteria;selecting a data set; selecting a set of actual values; selecting aninitial set of model coefficients; generating a set of model predictedvalues based on the complete data set and the initial set of modelcoefficients; (1) generating a set of errors based on the modelpredicted values and the set of actual values for the complete dataset;(2) generating a set of error threshold values based on the complete setof errors and the error criteria for the complete data set; (3)generating an outlier removed data set, wherein the filtering is basedon the complete data set and the set of error threshold values; (4)generating a set of new coefficients based on the filtered data set andthe set of previous coefficients, wherein the generation of the set ofnew coefficients is performed by the computer processor; (5) generatinga set of outlier bias reduced model predicted values based on thefiltered data set and the set of new model coefficients, wherein thegeneration of the set of outlier bias reduced model predicted values isperformed by a computer processor; (6) generating a set of modelperformance values based on the model predicted values and the set ofactual values; repeating steps (1)-(6), while substituting the set ofnew coefficients for the set of coefficients from the previousiteration, unless: a performance termination criteria is satisfied; andstoring the set of model predicted values in a computer data medium.

Another embodiment includes a computer implemented method for reducingoutlier bias comprising the steps of: selecting a target variable for afacility; selecting a set of actual values of the target variable;identifying a plurality of variables for the facility that are relatedto the target variable; obtaining a data set for the facility, the dataset comprising values for the plurality of variables; selecting a biascriteria; selecting a set of model coefficients; (1) generating a set ofpredicted values based on the complete data set and the set of modelcoefficients; (2) generating a set of censoring model performance valuesbased on the set of predicted values and the set of actual values; (3)generating an error set based on the set of predicted values and the setof actual values for the target variable; (4) generating a set of errorthreshold values based on the error set and the bias criteria; (5)generating, by a processor, a censored data set based on the data setand the set of error thresholds; (6) generating, by the processor, a setof new model coefficients based on the censored data set and the set ofmodel coefficients; (7) generating, by the processor, a set of newpredicted values based on the data set and the set of new modelcoefficients; (8) generating a set of new censoring model performancevalues based on the set of new predicted values and the set of actualvalues; using the set of new coefficients, repeating steps (1)-(8)unless a censoring performance termination criteria is satisfied; andstoring the set of new model predicted values in a computer data medium.

Another embodiment includes a computer implemented method for reducingoutlier bias comprising the steps of: determining a target variable fora facility, wherein the target variable is a metric for an industrialfacility related to its production, financial performance, or emissions;identifying a plurality of variables for the facility, wherein theplurality of variables comprises: a plurality of direct variables forthe facility that influence the target variable; and a set oftransformed variables for the facility, each transformed variable is afunction of at least one direct facility variable that influences thetarget variable; selecting an error criteria comprising: an absoluteerror, and a relative error; obtaining a data set for the facility,wherein the data set comprises values for the plurality of variables;selecting a set of actual values of the target variable; selecting aninitial set of model coefficients; generating a set of model predictedvalues based on the complete data set and the initial set of modelcoefficients; generating a complete set of errors based on the set ofmodel predicted values and the set of actual values, wherein therelative error is calculated using the formula: RelativeError_(m)=((Predicted Value_(m)−Actual Value_(m))/Actual Value_(m))²wherein ‘m’ is a reference number, and wherein the absolute error iscalculated using the formula: Absolute Error_(m)=(PredictedValue_(m)−Actual Value_(m))²; generating a set of model performancevalues based on the set of model predicted values and the set of actualvalues, wherein the set of overall model performance values comprisesof: a first standard error, and a first coefficient of determination;(1) generating a set of errors based on the model predicted values andthe set of actual values for the complete dataset; (2) generating a setof error threshold values based on the complete set of errors and theerror criteria for the complete data set; (3) generating an outlierremoved data set by removing data with error values greater than orequal to the error threshold values, wherein the filtering is based onthe complete data set and the set of error threshold values; (4)generating a set of outlier bias reduced model predicted values based onthe outlier removed data set and the set of model coefficients byminimizing the error between the set of predicted values and the set ofactual values using at least one of: a linear optimization model, and anonlinear optimization model, wherein the generation of the new modelpredicted values is performed by a computer processor; (5) generating aset of new coefficients based on the outlier removed data set and theprevious set of coefficients, wherein the generation of the set of newcoefficients is performed by the computer processor; (6) generating aset of overall model performance values based on the set of newpredicted model values and the set of actual values, wherein the set ofmodel performance values comprise: a second standard error, and a secondcoefficient of determination; repeating steps (1)-(6), whilesubstituting the set of new coefficients for the set of coefficientsfrom the previous iteration, unless: a performance termination criteriais satisfied, wherein the performance termination criteria comprises: astandard error termination value and a coefficient of determinationtermination value, and wherein satisfying the performance terminationcriteria comprises: the standard error termination value is greater thanthe difference between the first and second standard error, and thecoefficient of determination termination value is greater than thedifference between the first and second coefficient of determination;and storing the set of new model predicted values in a computer datamedium.

Another embodiment includes a computer implemented method for reducingoutlier bias comprising the steps of: selecting an error criteria;selecting a data set; selecting a set of actual values; selecting aninitial set of model predicted values; determining a set of errors basedon the set of model predicted values and the set of actual values; (1)determining a set of error threshold values based on the complete set oferrors and the error criteria; (2) generating an outlier removed dataset, wherein the filtering is based on the data set and the set of errorthreshold values; (3) generating a set of outlier bias reduced modelpredicted values based on the outlier removed data set and the previousmodel predicted values, wherein the generation of the set of outlierbias reduced model predicted values is performed by a computerprocessor; (4) determining a set of errors based on the set of new modelpredicted values and the set of actual values; repeating steps (1)-(4),while substituting the set of new model predicted values for the set ofmodel predicted values from the previous iteration, unless: aperformance termination criteria is satisfied; and storing the set ofoutlier bias reduced model predicted values in a computer data medium.

Another embodiment includes a computer implemented method for reducingoutlier bias comprising the steps of: determining a target variable fora facility; identifying a plurality of variables for the facility,wherein the plurality of variables comprises: a plurality of directvariables for the facility that influence the target variable; and a setof transformed variables for the facility, each transformed variablebeing a function of at least one direct facility variable thatinfluences the target variable; selecting an error criteria comprising:an absolute error, and a relative error; obtaining a data set, whereinthe data set comprises values for the plurality of variables, andselecting a set of actual values of the target variable; selecting aninitial set of model coefficients; generating a set of model predictedvalues by applying a set of model coefficients to the data set;determining a set of performance values based on the set of modelpredicted values and the set of actual values, wherein the set ofperformance values comprises: a first standard error, and a firstcoefficient of determination; (1) generating a set of errors based onthe set of model predicted values and the set of actual values for thecomplete dataset, wherein the relative error is calculated using theformula: Relative Error_(m)=((Predicted Value_(m)−ActualValue_(m))/Actual Value_(m))², wherein ‘m’ is a reference number, andwherein the absolute error is calculated using the formula: AbsoluteError_(m)=(Predicted Value_(m)−Actual Value_(m))²) (2) generating a setof error threshold values based on the complete set of errors and theerror criteria for the complete data set; (3) generating an outlierremoved data set by removing data with error values greater than orequal to the set of error threshold values, wherein the filtering isbased on the data set and the set of error threshold values; (4)generating a set of new coefficients based on the outlier removed dataset and the set of previous coefficients (5) generating a set of outlierbias reduced model predicted values based on the outlier removed dataset and the set of new model coefficient by minimizing the error betweenthe set of predicted values and the set of actual values using at leastone of: a linear optimization model, and a nonlinear optimization model,wherein the generation of the model predicted values is performed by acomputer processor; (6) generating a set of updated performance valuesbased on the set of outlier bias reduced model predicted values and theset of actual values, wherein the set of updated performance valuescomprises: a second standard error, and a second coefficient ofdetermination; repeating steps (1)-(6), while substituting the set ofnew coefficients for the set of coefficients from the previousiteration, unless: a performance termination criteria is satisfied,wherein the performance termination criteria comprises: a standard errortermination value, and a coefficient of determination termination value,and wherein satisfying the performance termination criteria comprisesthe standard error termination value is greater than the differencebetween the first and second standard error, and the coefficient ofdetermination termination value is greater than the difference betweenthe first and second coefficient of determination; and storing the setof outlier bias reduction factors in a computer data medium.

Another embodiment includes a computer implemented method for assessingthe viability of a data set as used in developing a model comprising thesteps of: providing a target data set comprising a plurality of datavalues; generating a random target data set based on the target dataset;selecting a set of bias criteria values; generating, by a processor, anoutlier bias reduced target data set based on the data set and each ofthe selected bias criteria values; generating, by the processor, anoutlier bias reduced random data set based on the random data set andeach of the selected bias criteria values; calculating a set of errorvalues for the outlier bias reduced data set and the outlier biasreduced random data set; calculating a set of correlation coefficientsfor the outlier bias reduced data set and the outlier bias reducedrandom data set; generating bias criteria curves for the data set andthe random data set based on the selected bias criteria values and thecorresponding error value and correlation coefficient; and comparing thebias criteria curve for the data set to the bias criteria curve for therandom data set. The outlier bias reduced target data set and theoutlier bias reduced random target data set are generated using theDynamic Outlier Bias Removal methodology. The random target data set cancomprise of randomized data values developed from values within therange of the plurality of data values. Also, the set of error values cancomprise a set of standard errors, and wherein the set of correlationcoefficients comprises a set of coefficient of determination values.Another embodiment can further comprise the step of generating automatedadvice regarding the viability of the target data set to support thedeveloped model, and vice versa, based on comparing the bias criteriacurve for the target data set to the bias criteria curve for the randomtarget data set. Advice can be generated based on parameters selected byanalysts, such as a correlation coefficient threshold and/or an errorthreshold. Yet another embodiment further comprises the steps of:providing an actual data set comprising a plurality of actual datavalues corresponding to the model predicted values; generating a randomactual data set based on the actual data set; generating, by aprocessor, an outlier bias reduced actual data set based on the actualdata set and each of the selected bias criteria values; generating, bythe processor, an outlier bias reduced random actual data set based onthe random actual data set and each of the selected bias criteriavalues; generating, for each selected bias criteria, a random data plotbased on the outlier bias reduced random target data set and the outlierbias reduced random actual data; generating, for each selected biascriteria, a realistic data plot based on the outlier bias reduced targetdata set and the outlier bias reduced actual target data set; andcomparing the random data plot with the realistic data plotcorresponding to each of the selected bias criteria.

A preferred embodiment includes a system comprising: a server,comprising: a processor, and a storage subsystem; a database stored bythe storage subsystem comprising: a data set; and a computer programstored by the storage subsystem comprising instructions that, whenexecuted, cause the processor to: select a bias criteria; provide a setof model coefficients; select a set of target values; (1) generate a setof predicted values for the data set; (2) generate an error set for thedataset; (3) generate a set of error threshold values based on the errorset and the bias criteria; (4) generate a censored data set based on theerror set and the set of error threshold values; (5) generate a set ofnew model coefficients; and (6) using the set of new model coefficients,repeat steps (1)-(5), unless a censoring performance terminationcriteria is satisfied. In a preferred embodiment, the set of predictedvalues may be generated based on the data set and the set of modelcoefficients. In a preferred embodiment, the error set may comprise aset of absolute errors and a set of relative errors, generated based onthe set of predicted values and the set of target values. In anotherembodiment, the error set may comprise values calculated as thedifference between the set of predicted values and the set of targetvalues. In another embodiment, the step of generating the set of newcoefficients may further comprise the step of minimizing the set oferrors between the set of predicted values and the set of actual values,which can be accomplished using a linear, or a non-linear optimizationmodel. In a preferred embodiment, the censoring performance terminationcriteria may be based on a standard error and a coefficient ofdetermination.

Another embodiment of the present invention includes a systemcomprising: a server, comprising: a processor, and a storage subsystem;a database stored by the storage subsystem comprising: a data set; and acomputer program stored by the storage subsystem comprising instructionsthat, when executed, cause the processor to: select an error criteria;select a set of actual values; select an initial set of coefficients;generate a complete set of model predicted values from the data set andthe initial set of coefficients; (1) generate a set of errors based onthe model predicted values and the set of actual values for the completedataset; (2) generate a set of error threshold values based on thecomplete set of errors and the error criteria for the complete data set;(3) generate an outlier removed data set, wherein the filtering is basedon the complete data set and the set of error threshold values; (4)generate a set of outlier bias reduced model predicted values based onthe outlier removed data set and the set of coefficients, wherein thegeneration of the set of outlier bias reduced model predicted values isperformed by a computer processor; (5) generate a set of newcoefficients based on the outlier removed data set and the set ofprevious coefficients, wherein the generation of the set of newcoefficients is performed by the computer processor; (6) generate a setof model performance values based on the outlier bias reduced modelpredicted values and the set of actual values; repeat steps (1)-(6),while substituting the set of new coefficients for the set ofcoefficients from the previous iteration, unless: a performancetermination criteria is satisfied; and store the set of overall outlierbias reduction model predicted values in a computer data medium.

Yet another embodiment includes a system comprising: a server,comprising: a processor, and a storage subsystem; a database stored bythe storage subsystem comprising: a target variable for a facility; aset of actual values of the target variable; a plurality of variablesfor the facility that are related to the target variable; a data set forthe facility, the data set comprising values for the plurality ofvariables; and a computer program stored by the storage subsystemcomprising instructions that, when executed, cause the processor to:select a bias criteria; select a set of model coefficients; (1) generatea set of predicted values based on the data set and the set of modelcoefficients; (2) generate a set of censoring model performance valuesbased on the set of predicted values and the set of actual values; (3)generate an error set based on the set of predicted values and the setof actual values for the target variable; (4) generate a set of errorthreshold values based on the error set and the bias criteria; (5)generate a censored data set based on the data set and the set of errorthresholds; (6) generate a set of new model coefficients based on thecensored data set and the set of model coefficients; (7) generate a setof new predicted values based on the data set and the set of new modelcoefficients; (8) generate a set of new censoring model performancevalues based on the set of new predicted values and the set of actualvalues; using the set of new coefficients, repeat steps (1)-(8) unless acensoring performance termination criteria is satisfied; and storing theset of new model predicted values in the storage subsystem.

Another embodiment includes a system comprising: a server, comprising: aprocessor, and a storage subsystem; a database stored by the storagesubsystem comprising: a data set for a facility; and a computer programstored by the storage subsystem comprising instructions that, whenexecuted, cause the processor to: determine a target variable; identifya plurality of variables, wherein the plurality of variables comprises:a plurality of direct variables for the facility that influence thetarget variable; and a set of transformed variables for the facility,each transformed variables being a function of at least one directvariable that influences the target variable; select an error criteriacomprising: an absolute error, and a relative error; select a set ofactual values of the target variable; select an initial set ofcoefficients; generate a set of model predicted values based on the dataset and the initial set of coefficients; determine a set of errors basedon the set of model predicted values and the set of actual values,wherein the relative error is calculated using the formula: RelativeError_(m)=((Predicted Value_(m)−Actual Value_(m))/Actual Value_(m))²,wherein ‘m’ is a reference number, and wherein the absolute error iscalculated using the formula: Absolute Error_(m)=(PredictedValue_(m)−Actual Value_(m))²; determine a set of performance valuesbased on the set of model predicted values and the set of actual values;wherein the set of performance values comprises: a first standard error,and a first coefficient of determination; (1) generate a set of errorsbased on the model predicted values and the set of actual values; (2)generating a set of error threshold values based on the complete set oferrors and the error criteria for the complete data set; (3) generate anoutlier removed data set by filtering data with error values outside theset of error threshold values, wherein the filtering is based on thedata set and the set of error threshold values; (4) generate a set ofnew model predicted values based on the outlier removed data set and theset of coefficients by minimizing an error between the set of modelpredicted values and the set of actual values using at least one of: alinear optimization model, and a nonlinear optimization model, whereinthe generation of the outlier bias reduced model predicted values isperformed by a computer processor; (5) generate a set of newcoefficients based on the outlier removed data set and the set ofprevious coefficients, wherein the generation of the set of newcoefficients is performed by the computer processor; (6) generate a setof performance values based on the set of new model predicted values andthe set of actual values; wherein the set of model performance valuescomprises: a second standard error, and a second coefficient ofdetermination; repeat steps (1)-(6), while substituting the set of newcoefficients for the set of coefficients from the previous iteration,unless: a performance termination criteria is satisfied, wherein theperformance termination criteria comprises: a standard error, and acoefficient of determination, and wherein satisfying the performancetermination criteria comprises: the standard error termination value isgreater than the difference between the first and second standard error,and the coefficient of determination termination value is greater thanthe difference between the first and second coefficient ofdetermination; and store the set of new model predicted values in acomputer data medium.

Another embodiment of the present invention includes a systemcomprising: a server, comprising: a processor, and a storage subsystem;a database stored by the storage subsystem comprising: a data set, acomputer program stored by the storage subsystem comprising instructionsthat, when executed, cause the processor to: select an error criteria;select a data set; select a set of actual values; select an initial setof model predicted values; determine a set of errors based on the set ofmodel predicted values and the set of actual values; (1) determine a setof error threshold values based on the complete set of errors and theerror criteria; (2) generate an outlier removed data set, wherein thefiltering is based on the data set and the set of error thresholdvalues; (3) generate a set of outlier bias reduced model predictedvalues based on the outlier removed data set and the complete set ofmodel predicted values, wherein the generation of the set of outlierbias reduced model predicted values is performed by a computerprocessor; (4) determine a set of errors based on the set of outlierbias reduction model predicted values and the corresponding set ofactual values; repeat steps (1)-(4), while substituting the set ofoutlier bias reduction model predicted values for the set of modelpredicted values unless: a performance termination criteria issatisfied; and store the set of outlier bias reduction factors in acomputer data medium.

Another embodiment of the present invention includes a systemcomprising: a server, comprising: a processor, and a storage subsystem;a database stored by the storage subsystem comprising: a data set, acomputer program stored by the storage subsystem comprising instructionsthat, when executed, cause the processor to: determine a targetvariable; identify a plurality of variables for the facility, whereinthe plurality of variables comprises: a plurality of direct variablesfor the facility that influence the target variable; and a set oftransformed variables for the facility, each transformed variable is afunction of at least one primary facility variable that influences thetarget variable; select an error criteria comprising: an absolute error,and a relative error; obtain a data set, wherein the data set comprisesvalues for the plurality of variables, and select a set of actual valuesof the target variable; select an initial set of coefficients; generatea set of model predicted values by applying the set of modelcoefficients to the data set; determine a set of performance valuesbased on the set of model predicted values and the set of actual values,wherein the set of performance values comprises: a first standard error,and a first coefficient of determination; (1) determine a set of errorsbased on the set of model predicted values and the set of actual values,wherein the relative error is calculated using the formula: RelativeError_(k)=((Predicted Value_(k)−Actual Value_(k))/Actual Value_(k))²,wherein ‘k’ is a reference number, and wherein the absolute error iscalculated using the formula: Absolute Error_(k)=(PredictedValue_(k)−Actual Value_(k))²; (2) determine a set of error thresholdvalues based on the set of errors and the error criteria for thecomplete data set; (3) generate an outlier removed data set by removingdata with error values greater than or equal to the error thresholdvalues, wherein the filtering is based on the data set and the set oferror threshold values; (4) generate a set of new coefficients based onthe outlier removed dataset and the set of previous coefficients; (5)generate a set of outlier bias reduced model values based on the outlierremoved data set and the set of coefficients and minimizing an errorbetween the set of predicted values and the set of actual values usingat least one of: a linear optimization model, and a nonlinearoptimization model; (5) determine a set of updated performance valuesbased on the set of outlier bias reduced model predicted values and theset of actual values, wherein the set of updated performance valuescomprises: a second standard error, and a second coefficient ofdetermination; repeat steps (1)-(5), while substituting the set of newcoefficients for the set of coefficients from the previous iteration,unless: a performance termination criteria is satisfied, wherein theperformance termination criteria comprises: a standard error terminationvalue, and a coefficient of determination termination value, and whereinsatisfying the performance termination criteria comprises the standarderror termination value is greater than the difference between the firstand second standard error, and the coefficient of determinationtermination value is greater than the difference between the first andsecond coefficient of determination; and storing the set of outlier biasreduction factors in a computer data medium.

Yet another embodiment includes a system for assessing the viability ofa data set as used in developing a model comprising: a server,comprising: a processor, and a storage subsystem; a database stored bythe storage subsystem comprising: a target data set comprising aplurality of model predicted values; a computer program stored by thestorage subsystem comprising instructions that, when executed, cause theprocessor to: generate a random target data set; select a set of biascriteria values; generate outlier bias reduced data sets based on thetarget data set and each of the selected bias criteria values; generatean outlier bias reduced random target data set based on the randomtarget data set and each of the selected bias criteria values; calculatea set of error values for the outlier bias reduced target data set andthe outlier bias reduced random target data set; calculate a set ofcorrelation coefficients for the outlier bias reduced target data setand the outlier bias reduced random target data set; generate biascriteria curves for the target data set and the random target data setbased on the corresponding error value and correlation coefficient foreach selected bias criteria; and compare the bias criteria curve for thetarget data set to the bias criteria curve for the random target dataset. The processor generates the outlier bias reduced target data setand the outlier bias reduced random target data set using the DynamicOutlier Bias Removal methodology. The random target data set cancomprise of randomized data values developed from values within therange of the plurality of data values. Also, the set of error values cancomprise a set of standard errors, and the set of correlationcoefficients comprises a set of coefficient of determination values. Inanother embodiment, the program further comprises instructions that,when executed, cause the processor to generate automated advice based oncomparing the bias criteria curve for the target data set to the biascriteria curve for the random target data set. Advice can be generatedbased on parameters selected by analysts, such as a correlationcoefficient threshold and/or an error threshold. In yet anotherembodiment, the system's database further comprises an actual data setcomprising a plurality of actual data values corresponding to the modelpredicted values, and the program further comprises instructions that,when executed, cause the processor to: generate a random actual data setbased on the actual data set; generate an outlier bias reduced actualdata set based on the actual data set and each of the selected biascriteria values; generate an outlier bias reduced random actual data setbased on the random actual data set and each of the selected biascriteria values; generate, for each selected bias criteria, a randomdata plot based on the outlier bias reduced random target data set andthe outlier bias reduced random actual data; generate, for each selectedbias criteria, a realistic data plot based on the outlier bias reducedtarget data set and the outlier bias reduced actual target data set; andcompare the random data plot with the realistic data plot correspondingto each of the selected bias criteria.

Other embodiments include a system for reducing outlier bias in targetvariables measured for a facility comprising a computing unit forprocessing a data set, the computing unit comprising a processor and astorage subsystem, an input unit for inputting the data set to beprocessed, the input unit comprising a measuring device for measuring agiven target variable and for providing a corresponding data set, anoutput unit for outputting a processed data set, a computer programstored by the storage subsystem comprising instructions that, whenexecuted, cause the processor to execute following steps: selecting thetarget variable for a facility; identifying a plurality of variables forthe facility that are related to the target variable; obtaining a dataset for the facility, the data set comprising values for the pluralityof variables; selecting a bias criteria; selecting a set of modelcoefficients; (1) generate a set of predicted values for the data set;(2) generate an error set for the data set; (3) generate a set of errorthreshold values based on the error set and the bias criteria; (4)generate a censored data set based on the error set and the set of errorthreshold values; (5) generate a set of new model coefficients; and (6)using the set of new model coefficients, repeat steps (1)-(5), unless acensoring performance termination criteria is satisfied.

Still, other embodiment include a system for reducing outlier bias intarget variables measured for a financial instrument, such as equitysecurity (e.g., common stock) or derivative contract (e.g., forwards,futures, options, and swaps, etc.), comprising a computing unit forprocessing a data set, the computing unit comprising a processor and astorage subsystem, an input unit for receiving the data set to beprocessed, the input unit comprising a storage device for storing dataon a target variable (e.g., stock price) and for providing acorresponding data set, an output unit for outputting a processed dataset, a computer program stored by the storage subsystem comprisinginstructions that, when executed, cause the processor to executefollowing steps: selecting the target variable for the financialinstrument; identifying a plurality of variables for the instrument thatare related to the target variable (e.g., dividends, earnings, cashflow, etc.); obtaining a data set for the financial instrument, the dataset comprising values for the plurality of variables; selecting a biascriteria; selecting a set of model coefficients; (1) generate a set ofpredicted values for the data set; (2) generate an error set for thedata set; (3) generate a set of error threshold values based on theerror set and the bias criteria; (4) generate a censored data set basedon the error set and the set of error threshold values; (5) generate aset of new model coefficients; and (6) using the set of new modelcoefficients, repeat steps (1)-(5), unless a censoring performancetermination criteria is satisfied.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart illustrating an embodiment of the data outlieridentification and removal method.

FIG. 2 is a flowchart illustrating an embodiment of the data outlieridentification and removal method for data quality operations.

FIG. 3 is a flowchart illustrating an embodiment of the data outlieridentification and removal method for data validation.

FIG. 4 is an illustrative node for implementing a method of theinvention.

FIG. 5 is an illustrative graph for quantitative assessment of a dataset.

FIGS. 6A and 6B are illustrative graphs for qualitative assessment ofthe data set of FIG. 5, illustrating the randomized and realistic dataset, respectively, for the entire data set.

FIGS. 7A and 7B are illustrative graphs for qualitative assessment ofthe data set of FIG. 5, illustrating the randomized and realistic dataset, respectively, after removal of 30% of the data as outliers.

FIGS. 8A and 8B are illustrative graphs for qualitative assessment ofthe data set of FIG. 5, illustrating the randomized and realistic dataset, respectively, after removal of 50% of the data as outliers.

FIG. 9 illustrates an exemplary system used to reduce outlier bias intarget variables measured for a facility.

DETAILED DESCRIPTION OF THE INVENTION

The following disclosure provides many different embodiments, orexamples, for implementing different features of a system and method foraccessing and managing structured content. Specific examples ofcomponents, processes, and implementations are described to help clarifythe invention. These are merely examples and are not intended to limitthe invention from that described in the claims. Well-known elements arepresented without detailed description so as not to obscure thepreferred embodiments of the present invention with unnecessary detail.For the most part, details unnecessary to obtain a completeunderstanding of the preferred embodiments of the present invention havebeen omitted inasmuch as such details are within the skills of personsof ordinary skill in the relevant art.

A mathematical description of one embodiment of Dynamic Outlier BiasReduction is shown as follows:

Nomenclature

-   -   {circumflex over (X)}—Set of all data records: {circumflex over        (X)}={circumflex over (X)}_(k)+{circumflex over (X)}_(Ck),        where:        -   {circumflex over (X)}_(k)—Set of accepted data records for            the k^(th) iteration        -   {circumflex over (X)}_(Ck)—Set of outlier (removed) data            records for the k^(th) iteration    -   {circumflex over (Q)}_(k)—Set of computed model predicted values        for {circumflex over (X)}_(k)    -   {circumflex over (Q)}_(Ck)—Set of outlier model predicted values        for data records, {circumflex over (X)}_(Ck)    -   Â—Set of actual values (target values) on which the model is        based    -   {circumflex over (β)}_(k→k+1)—Set of model coefficients at the        k+1^(st) iteration computed as a result of the model        computations using {circumflex over (X)}_(k)    -   M({circumflex over (X)}_(k):{circumflex over (β)}_(k→k+1))—Model        computation producing {circumflex over (Q)}_(k+1) from        {circumflex over (X)}_(k) storing model derived and        user-supplied coefficients: {circumflex over (β)}_(k→k+1)    -   C—User supplied error criteria (%)    -   Ψ({circumflex over (Q)}_(k),        —Error threshold function    -   F(Ψ, C)—Error threshold value (E) {circumflex over        (Ω)}_(k)—Iteration termination criteria, e.g., iteration count,        r², standard error, etc.        Initial Computation, k=0        Initial Step 1: Using initial model coefficient estimates,        {circumflex over (β)}_(0→1), compute initial model predicted        values by applying the model to the complete data set:

{circumflex over (Q)} ₁ =M({circumflex over (X)}: {circumflex over (β)}_(0→1))

Initial Step 2: Compute initial model performance results:

{circumflex over (Ω)}₁ =f({circumflex over (Q)} ₁ ,Â,k=0,r ², standarderror, etc.)

Initial Step 3: Compute model error threshold value(s):

E ₁ =F(Ψ({circumflex over (Q)} ₁ ,

,C)

Initial Step 4: Filter the data records to remove outliers:

{circumflex over (X)} ₁ ={∀x∈{circumflex over (X)}|Ψ({circumflex over(Q)} ₁ ,

<E ₁}

Iterative Computations, k>0

Iteration Step 1: Compute predicted values by applying the model to theaccepted data set:

{circumflex over (Q)} _(k+1) =M({circumflex over (X)} _(k): {circumflexover (β)}_(k→k+1))

Iteration Step 2: Compute model performance results:

{circumflex over (Ω)}_(k+1) =f({circumflex over (Q)} _(k+1) ,Â,k,r ²,standard error, etc.)

If termination criteria are achieved, stop, otherwise proceed to Step 3:Iteration Step 3: Compute results for removed data, {circumflex over(X)}_(Ck)={∀x∈{circumflex over (X)}|x∉{circumflex over (X)}_(k)} usingcurrent model:

{circumflex over (Q)} _(Ck+1) =M({circumflex over (X)} _(Ck):{circumflexover (β)}_(k→k+1))

Iteration Step 4: Compute model error threshold values:

E _(k+1) =F(Ψ({circumflex over (Q)} _(k+1) +{circumflex over (Q)}_(Ck+1) ,

,C)

Iteration Step 5: Filter the data records to remove outliers:

{circumflex over (X)} _(k+1) ={∀x∈{circumflex over (X)}|Ψ{circumflexover (Q)} _(k+1) +{circumflex over (Q)} _(Ck+1) ,

<E _(k+1)}

Another mathematical description of one embodiment of Dynamic OutlierBias Reduction is shown as follows:

Nomenclature

-   -   {circumflex over (X)}—Set of all data records: {circumflex over        (X)}={circumflex over (X)}_(k)+{circumflex over (X)}_(Ck),        where:        -   {circumflex over (X)}_(k)—Set of accepted data records for            the k^(th) iteration        -   {circumflex over (X)}_(Ck)—Set of outlier (removed) data            records for the k^(th) iteration    -   {circumflex over (Q)}_(k)—Set of computed model predicted values        for {circumflex over (X)}_(k)    -   {circumflex over (Q)}_(Ck)—Set of outlier model predicted values        for {circumflex over (X)}_(Ck)    -   Â—Set of actual values (target values) on which the model is        based    -   {circumflex over (β)}_(k→k+1)—Set of model coefficients at the        k+1^(st) iteration computed as a result of the model        computations using {circumflex over (X)}_(k)    -   M({circumflex over (X)}_(k): {circumflex over        (β)}_(k→k+1))—Model computation producing {circumflex over        (Q)}_(k+1) from {circumflex over (X)}_(k) storing model derived        and user-supplied coefficients: {circumflex over (β)}_(k→k+1)    -   C_(RE)—User supplied relative error criterion(%)    -   C_(AE)—User supplied absolute error criterion(%)    -   RE({circumflex over (Q)}_(k)+{circumflex over (Q)}_(Ck),        Â)—Relative error values for all data records    -   AE({circumflex over (Q)}_(k)+{circumflex over (Q)}_(ck),        Â)—Absolute error values for all data records        -   P_(RE) _(k) —Relative error threshold value for the k^(th)            iteration where

P _(RE) _(k) =Percentile(RE({circumflex over (Q)} _(k) +{circumflex over(Q)} _(Ck) ,Â),C _(RE))

-   -   P_(AE) _(k) —Absolute error threshold value for the k^(th)        iteration where

P _(AE) _(k) =Percentile(AE({circumflex over (Q)} _(k) +{circumflex over(Q)} _(Ck) ,Â),C _(AE))

-   -   {circumflex over (Ω)}_(k)—Iteration termination criteria, e.g.,        iteration count, r², standard error, etc.        Initial Computation, k=0

Initial Step 1: Using initial model coefficient estimates, {circumflexover (β)}_(0→1), compute initial model predicted value results byapplying the model to the complete data set:

{circumflex over (Q)} ₁ =M({circumflex over (X)}:{circumflex over (β)}_(0→1))

Initial Step 2: Compute initial model performance results:

{circumflex over (Ω)}₁ =f({circumflex over (Q)} ₁ ,Â,k=0,r ², standarderror, etc.)

Initial Step 3: Compute model error threshold values:

P _(RE) ₁ =Percentile(RE({circumflex over (Q)} ₁ ,Â),C _(RE))

P _(AE) ₁ =Percentile(AE({circumflex over (Q)} ₁ ,Â),C _(AE))

Initial Step 4: Filter the data records to remove outliers:

${\hat{X}}_{1} = \{ {{\forall{x \in \hat{X}}}{\begin{Bmatrix}{{RE}( {{\hat{Q}}_{1},\hat{A}} )} \\{{AE}( {{\hat{Q}}_{1},\hat{A}} )}\end{Bmatrix} < \begin{pmatrix}P_{RE} \\P_{AE}\end{pmatrix}_{1}}} \}$

Iterative Computations, k>0

Iteration Step 1: Compute model predicted values by applying the modelto the outlier removed data set:

{circumflex over (Q)} _(k+1) =M({circumflex over (X)} _(k):{circumflexover (β)}_(k→k+1))

Iteration Step 2: Compute model performance results:

{circumflex over (Ω)}_(k+1) =f({circumflex over (Q)} _(k+1) ,Â,k,r ²,standard error, etc.)

If termination criteria are achieved, stop, otherwise proceed to Step 3:

Iteration Step 3: Compute results for the removed data, {circumflex over(X)}_(Ck)={∀x∈{circumflex over (X)}|x∉{circumflex over (X)}_(k)} usingcurrent model:

{circumflex over (Q)} _(Ck+1) =M({circumflex over (X)} _(Ck):{circumflexover (β)}_(k→k+1))

Iteration Step 4: Compute model error threshold values:

P _(RE) _(k+1) =Percentile(RE({circumflex over (Q)} _(k+1) +{circumflexover (Q)} _(Ck+1) ,Â),C _(RE))

P _(AE) _(k+1) =Percentile(AE({circumflex over (Q)} _(k+1) +{circumflexover (Q)} _(Ck+1) ,Â),C _(AE))

Iteration Step 5: Filter the data records to remove outliers:

${\hat{X}}_{k + 1} = \{ {{\forall{x \in \hat{X}}}{\begin{Bmatrix}{{RE}( {{{\hat{Q}}_{k + 1} + {\hat{Q}}_{{Ck} + 1}},\hat{A}} )} \\{{AE}( {{{\hat{Q}}_{k + 1} + {\hat{Q}}_{{Ck} + 1}},\hat{A}} )}\end{Bmatrix} < \begin{pmatrix}P_{RE} \\P_{AE}\end{pmatrix}_{k + 1}}} \}$

Increment k and proceed to Iteration Step 1.

After each iteration where new model coefficients are computed from thecurrent censored dataset, the removed data from the previous iterationplus the current censored data are recombined. This combinationencompasses all data values in the complete dataset. The current modelcoefficients are then applied to the complete dataset to compute acomplete set of predicted values. The absolute and relative errors arecomputed for the complete set of predicted values and new bias criteriapercentile threshold values are computed. A new censored dataset iscreated by removing all data values where the absolute or relativeerrors are greater than the threshold values and the nonlinearoptimization model is then applied to the newly censored datasetcomputing new model coefficients. This process enables all data valuesto be reviewed every iteration for their possible inclusion in the modeldataset. It is possible that some data values that were excluded inprevious iterations will be included in subsequent iterations as themodel coefficients converge on values that best fit the data.

In one embodiment, variations in GHG emissions can result inoverestimation or underestimation of emission results leading to bias inmodel predicted values. These non-industrial influences, such asenvironmental conditions and errors in calculation procedures, can causethe results for a particular facility to be radically different fromsimilar facilities, unless the bias in the model predicted values isremoved. The bias in the model predicted values may also exist due tounique operating conditions.

The bias can be removed manually by simply removing a facility's datafrom the calculation if analysts are confident that a facility'scalculations are in error or possess unique, extenuatingcharacteristics. Yet, when measuring a facility performance from manydifferent companies, regions, and countries, precise a priori knowledgeof the data details is not realistic. Therefore any analyst-based dataremoval procedure has the potential for adding undocumented, non-datasupported biases to the model results.

In one embodiment, Dynamic Outlier Bias Reduction is applied to aprocedure that uses the data and a prescribed overall error criteria todetermine statistical outliers that are removed from the modelcoefficient calculations. This is a data-driven process that identifiesoutliers using a data produced global error criteria using for example,the percentile function. The use of Dynamic Outlier Bias Reduction isnot limited to the reduction of bias in model predicted values, and itsuse in this embodiment is illustrative and exemplary only. DynamicOutlier Bias Reduction may also be used, for example, to remove outliersfrom any statistical data set, including use in calculation of, but notlimited to, arithmetic averages, linear regressions, and trend lines.The outlier facilities are still ranked from the calculation results,but the outliers are not used in the filtered data set applied tocompute model coefficients or statistical results.

A standard procedure, commonly used to remove outliers, is to computethe standard deviation (σ) of the data set and simply define all dataoutside a 2σ interval of the mean, for example, as outliers. Thisprocedure has statistical assumptions that, in general, cannot be testedin practice. The Dynamic Outlier Bias Reduction method descriptionapplied in an embodiment of this invention, is outlined in FIG. 1, usesboth a relative error and absolute error. For example: for a facility,‘m’:

Relative Error_(m)=((Predicted Value_(m)−Actual Value_(m))/ActualValue_(m))²  (1)

Absolute Error_(m)=(Predicted Value_(m)−Actual Value_(m))²  (2)

In Step 110, the analyst specifies the error threshold criteria thatwill define outliers to be removed from the calculations. For exampleusing the percentile operation as the error function, a percentile valueof 80 percent for relative and absolute errors could be set. This meansthat data values less than the 80th percentile value for a relativeerror and less than the 80th percentile value for absolute errorcalculation will be included and the remaining values are removed orconsidered as outliers. In this example, for a data value to avoid beingremoved, the data value must be less than both the relative and absoluteerror 80th percentile values. However, the percentile thresholds forrelative and absolute error may be varied independently, and, in anotherembodiment, only one of the percentile thresholds may be used.

In Step 120, the model standard error and coefficient of determination(r²) percent change criteria are specified. While the values of thesestatistics will vary from model to model, the percent change in thepreceding iteration procedure can be preset, for example, at 5 percent.These values can be used to terminate the iteration procedure. Anothertermination criteria could be the simple iteration count.

In Step 130, the optimization calculation is performed, which producesthe model coefficients and predicted values for each facility.

In Step 140, the relative and absolute errors for all facilities arecomputed using Eqns. (1) and (2).

In Step 150, the error function with the threshold criteria specified inStep 110 is applied to the data computed in Step 140 to determineoutlier threshold values.

In Step 160, the data is filtered to include only facilities where therelative error, absolute error, or both errors, depending on the chosenconfiguration, are less than the error threshold values computed in Step150.

In Step 170, the optimization calculation is performed using only theoutlier removed data set.

In Step 180, the percent change of the standard error and r² arecompared with the criteria specified in Step 120. If the percent changeis greater than the criteria, the process is repeated by returning toStep 140. Otherwise, the iteration procedure is terminated in step 190and the resultant model computed from this Dynamic Outlier BiasReduction criteria procedure is completed. The model results are appliedto all facilities regardless of their current iterative past removed oradmitted data status.

In another embodiment, the process begins with the selection of certainiterative parameters, specifically:

(1) an absolute error and relative error percentile value wherein one,the other or both may be used in the iterative process,

(2) a coefficient of determination (also known as r²) improvement value,and

(3) a standard error improvement value.

The process begins with an original data set, a set of actual data, andeither at least one coefficient or a factor used to calculate predictedvalues based on the original data set. A coefficient or set ofcoefficients will be applied to the original data set to create a set ofpredicted values. The set of coefficients may include, but is notlimited to, scalars, exponents, parameters, and periodic functions. Theset of predicted data is then compared to the set of actual data. Astandard error and a coefficient of determination are calculated basedon the differences between the predicted and actual data. The absoluteand relative error associated with each one of the data points is usedto remove data outliers based on the user-selected absolute and relativeerror percentile values. Ranking the data is not necessary, as all datafalling outside the range associated with the percentile values forabsolute and/or relative error are removed from the original data set.The use of absolute and relative errors to filter data is illustrativeand for exemplary purposes only, as the method may be performed withonly absolute or relative error or with another function.

The data associated with the absolute and relative error within auser-selected percentile range is the outlier removed data set, and eachiteration of the process will have its own filtered data set. This firstoutlier removed data set is used to determine predicted values that willbe compared with actual values. At least one coefficient is determinedby optimizing the errors, and then the coefficient is used to generatepredicted values based on the first outlier removed data set. Theoutlier bias reduced coefficients serve as the mechanism by whichknowledge is passed from one iteration to the next.

After the first outlier removed data set is created, the standard errorand coefficient of determination are calculated and compared with thestandard error and coefficient of determination of the original dataset. If the difference in standard error and the difference incoefficient of determination are both below their respective improvementvalues, then the process stops. However, if at least one of theimprovement criteria is not met, then the process continues with anotheriteration. The use of standard error and coefficient of determination aschecks for the iterative process is illustrative and exemplary only, asthe check can be performed using only the standard error or only thecoefficient of determination, a different statistical check, or someother performance termination criteria (such as number of iterations).

Assuming that the first iteration fails to meet the improvementcriteria, the second iteration begins by applying the first outlier biasreduced data coefficients to the original data to determine a new set ofpredicted values. The original data is then processed again,establishing absolute and relative error for the data points as well asthe standard error and coefficient of determination values for theoriginal data set while using the first outlier removed data setcoefficients. The data is then filtered to form a second outlier removeddata set and to determine coefficients based on the second outlierremoved data set.

The second outlier removed data set, however, is not necessarily asubset of the first outlier removed data set and it is associated withsecond set of outlier bias reduced model coefficients, a second standarderror, and a second coefficient of determination. Once those values aredetermined, the second standard error will be compared with the firststandard error and the second coefficient of determination will becompared against the first coefficient of determination.

If the improvement value (for standard error and coefficient ofdetermination) exceeds the difference in these parameters, then theprocess will end. If not, then another iteration will begin byprocessing the original data yet again; this time using the secondoutlier bias reduced coefficients to process the original data set andgenerate a new set of predicted values. Filtering based on theuser-selected percentile value for absolute and relative error willcreate a third outlier removed data set that will be optimized todetermine a set of third outlier bias reduced coefficients. The processwill continue until the error improvement or other termination criteriaare met (such as a convergence criteria or a specified number ofiterations).

The output of this process will be a set of coefficients or modelparameters, wherein a coefficient or model parameter is a mathematicalvalue (or set of values), such as, but not limited to, a model predictedvalue for comparing data, slope and intercept values of a linearequation, exponents, or the coefficients of a polynomial. The output ofDynamic Outlier Bias Reduction will not be an output value of its ownright, but rather the coefficients that will modify data to determine anoutput value.

In another embodiment, illustrated in FIG. 2, Dynamic Outlier BiasReduction is applied as a data quality technique to evaluate theconsistency and accuracy of data to verify that the data is appropriatefor a specific use. For data quality operations, the method may notinvolve an iterative procedure. Other data quality techniques may beused alongside Dynamic Outlier Bias Reduction during this process. Themethod is applied to the arithmetic average calculation of a given dataset. The data quality criteria, for this example is that the successivedata values are contained within some range. Thus, any values that arespaced too far apart in value would constitute poor quality data. Errorterms are then constructed of successive values of a function andDynamic Outlier Bias Reduction is applied to these error values.

In Step 210 the initial data is listed in any order.

Step 220 constitutes the function or operation that is performed on thedataset. In this embodiment example, the function and operation is theascending ranking of the data followed by successive arithmetic averagecalculations where each line corresponds to the average of all data atand above the line.

Step 230 computes the relative and absolute errors from the data usingsuccessive values from the results of Step 220.

Step 240 allows the analyst to enter the desired outlier removal errorcriteria (%). The Quality Criteria Value is the resultant value from theerror calculations in Step 230 based on the data in Step 220.

Step 250 shows the data quality outlier filtered dataset. Specificvalues are removed if the relative and absolute errors exceed thespecified error criteria given in Step 240.

Step 260 shows the arithmetic average calculation comparison between thecomplete and outlier removed datasets. The analyst is the final step asin all applied mathematical or statistical calculations judging if theidentified outlier removed data elements are actually poor quality ornot. The Dynamic Outlier Bias Reduction system and method eliminates theanalyst from directly removing data but best practice guidelines suggestthe analyst review and check the results for practical relevance.

In another embodiment illustrated in FIG. 3, Dynamic Outlier BiasReduction is applied as a data validation technique that tests thereasonable accuracy of a data set to determine if the data areappropriate for a specific use. For data validation operations, themethod may not involve an iterative procedure. In this example, DynamicOutlier Bias Reduction is applied to the calculation of the PearsonCorrelation Coefficient between two data sets. The Pearson CorrelationCoefficient can be sensitive to values in the data set that arerelatively different than the other data points. Validating the data setwith respect to this statistic is important to ensure that the resultrepresents what the majority of data suggests rather than influence ofextreme values. The data validation process for this example is thatsuccessive data values are contained within a specified range. Thus, anyvalues that are spaced too far apart in value (e.g. outside thespecified range) would signify poor quality data. This is accomplishedby constructing the error terms of successive values of the function.Dynamic Outlier Bias Reduction is applied to these error values, and theoutlier removed data set is validated data.

In Step 310, the paired data is listed in any order.

Step 320 computes the relative and absolute errors for each ordered pairin the dataset.

Step 330 allows the analyst to enter the desired data validationcriteria. In the example, both 90% relative and absolute errorthresholds are selected. The Quality Criteria Value entries in Step 330are the resultant absolute and relative error percentile values for thedata shown in Step 320.

Step 340 shows the outlier removal process where data that may beinvalid is removed from the dataset using the criteria that the relativeand absolute error values both exceed the values corresponding to theuser selected percentile values entered in Step 330. In practice othererror criteria may be used and when multiple criteria are applied asshown in this example, any combination of error values may be applied todetermine the outlier removal rules.

Step 350 computes the data validated and original data valuesstatistical results. In this case, the Pearson Correlation Coefficient.These results are then reviewed for practical relevance by the analyst.

In another embodiment, Dynamic Outlier Bias Reduction is used to performa validation of an entire data set. Standard error improvement value,coefficient of determination improvement value, and absolute andrelative error thresholds are selected, and then the data set isfiltered according to the error criteria. Even if the original data setis of high quality, there will still be some data that will have errorvalues that fall outside the absolute and relative error thresholds.Therefore, it is important to determine if any removal of data isnecessary. If the outlier removed data set passes the standard errorimprovement and coefficient of determination improvement criteria afterthe first iteration, then the original data set has been validated,since the filtered data set produced a standard error and coefficient ofdetermination that too small to be considered significant (e.g. belowthe selected improvement values).

In another embodiment, Dynamic Outlier Bias Reduction is used to provideinsight into how the iterations of data outlier removal are influencingthe calculation. Graphs or data tables are provided to allow the user toobserve the progression in the data outlier removal calculations as eachiteration is performed. This stepwise approach enables analysts toobserve unique properties of the calculation that can add value andknowledge to the result. For example, the speed and nature ofconvergence can indicate the influence of Dynamic Outlier Bias Reductionon computing representative factors for a multi-dimensional data set.

As an illustration, consider a linear regression calculation over a poorquality data set of 87 records. The form of the equation being regressedis y=mx+b. Table 1 shows the results of the iterative process for 5iterations. Notice that using relative and absolute error criteria of95%, convergence is achieved in 3 iterations. Changes in the regressioncoefficients can be observed and the Dynamic Outlier Bias Reductionmethod reduced the calculation data set based on 79 records. Therelatively low coefficient of determination (r²=39%) suggests that alower (<95%) criteria should be tested to study the additional outlierremoval effects on the r² statistic and on the computed regressioncoefficients.

TABLE 1 Dynamic Outlier Bias Reduction Example: Linear Regression at 95%Iteration N Error r² m b 0 87 3.903 25% −0.428 41.743 1 78 3.048 38%−0.452 43.386 2 83 3.040 39% −0.463 44.181 3 79 3.030 39% −0.455 43.6304 83 3.040 39% −0.463 44.181 5 79 3.030 39% −0.455 43.630

In Table 2 the results of applying Dynamic Outlier Bias Reduction areshown using the relative and absolute error criteria of 80%. Notice thata 15 percentage point (95% to 80%) change in outlier error criteriaproduced 35 percentage point (39% to 74%) increase in r² with a 35%additional decrease in admitted data (79 to 51 records included). Theanalyst can use a graphical view of the changes in the regression lineswith the outlier removed data and the numerical results of Tables 1 and2 in the analysis process to communicate the outlier removed results toa wider audience and to provide more insights regarding the effects ofdata variability on the analysis results.

TABLE 2 Dynamic Outlier Bias Reduction Example: Linear Regression at 80%Iteration N Error r² m b 0 87 3.903 25% −0.428 41.743 1 49 1.607 73%−0.540 51.081 2 64 1.776 68% −0.561 52.361 3 51 1.588 74% −0.558 52.5144 63 1.789 68% −0.559 52.208 5 51 1.588 74% −0.558 52.514

As illustrated in FIG. 4, one embodiment of system used to perform themethod includes a computing system. The hardware consists of a processor410 that contains adequate system memory 420 to perform the requirednumerical computations. The processor 410 executes a computer programresiding in system memory 420 to perform the method. Video and storagecontrollers 430 may be used to enable the operation of display 440. Thesystem includes various data storage devices for data input such asfloppy disk units 450, internal/external disk drives 460, internalCD/DVDs 470, tape units 480, and other types of electronic storage media490. The aforementioned data storage devices are illustrative andexemplary only. These storage media are used to enter data set andoutlier removal criteria into to the system, store the outlier removeddata set, store calculated factors, and store the system-produced trendlines and trend line iteration graphs. The calculations can applystatistical software packages or can be performed from the data enteredin spreadsheet formats using Microsoft Excel, for example. Thecalculations are performed using either customized software programsdesigned for company-specific system implementations or by usingcommercially available software that is compatible with Excel or otherdatabase and spreadsheet programs. The system can also interface withproprietary or public external storage media 300 to link with otherdatabases to provide data to be used with the Dynamic Outlier BiasReduction system and method calculations. The output devices can be atelecommunication device 510 to transmit the calculation worksheets andother system produced graphs and reports via an intranet or the Internetto management or other personnel, printers 520, electronic storage mediasimilar to those mentioned as input devices 450, 460, 470, 480, 490 andproprietary storage databases 530. These output devices used herein areillustrative and exemplary only.

As illustrated in FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B, in oneembodiment, Dynamic Outlier Bias Reduction can be used to quantitativelyand qualitatively assess the quality of the data set based on the errorand correlation of the data set's data values, as compared to the errorand correlation of a benchmark dataset comprised of random data valuesdeveloped from within an appropriate range. In one embodiment, the errorcan be designated to be the data set's standard error, and thecorrelation can be designated to be the data set's coefficient ofdetermination (r²). In another embodiment, correlation can be designatedto be the Kendall rank correlation coefficient, commonly referred to asKendall's tau (τ) coefficient. In yet another embodiment, correlationcan be designated to be the Spearman's rank correlation coefficient, orSpearman's ρ (rho) coefficient. As explained above, Dynamic Outlier BiasReduction is used to systematically remove data values that areidentified as outliers, not representative of the underlying model orprocess being described. Normally, outliers are associated with arelatively small number of data values. In practice, however, a datasetcould be unknowingly contaminated with spurious values or random noise.The graphical illustration of FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8Billustrate how the Dynamic Outlier Bias Reduction system and method canbe applied to identify situations where the underlying model is notsupported by the data. The outlier reduction is performed by removingdata values for which the relative and/or absolute errors, computedbetween the model predicted and actual data values, are greater than apercentile-based bias criteria, e.g. 80%. This means that the datavalues are removed if either the relative or absolute error percentilevalues are greater than the percentile threshold values associated withthe 80th percentile (80% of the data values have an error less than thisvalue.)

As illustrated in FIG. 5, both a realistic model development dataset anda dataset of random values developed within the range of the actualdataset are compared. Because in practice the analysts typically do nothave prior knowledge of any dataset contamination, such realization mustcome from observing the iterative results from several modelcalculations using the dynamic outlier bias reduction system and method.FIG. 5 illustrates an exemplary model development calculation resultsfor both datasets. The standard error, a measure of the amount of modelunexplained error, is plotted versus the coefficient of determination(%) or r², representing how much data variation is explained by themodel. The percentile values next to each point represent the biascriteria. For example, 90% signifies that data values for relative orabsolute error values greater than the 90th percentile are removed fromthe model as outliers. This corresponds to removing 10% of the datavalues with the highest errors each iteration.

As FIG. 5 illustrates, for both the random and realistic dataset models,error is reduced by increasing the bias criteria, i.e., the standarderror and the coefficient of determination are improved for bothdatasets. However, the standard error for the random dataset is two tothree times larger than the realistic model dataset. The analyst may usea coefficient of determination requirement of 80%, for example, as anacceptable level of precision for determining model parameters. In FIG.5, an r² of 80% is achieved at a 70% bias criteria for the randomdataset, and at an approximately 85% bias criteria for the realisticdata. However, the corresponding standard error for the random datasetis over twice as large as the realistic dataset. Thus, by systematicallyrunning the model dataset analysis with different bias criteria andrepeating the calculations with a representative spurious dataset andplotting the result as shown in FIG. 5, analysts can assess acceptablebias criteria (i.e., the acceptable percentage of data values removed)for a data set, and accordingly, the overall dataset quality. Moreover,such systematic model dataset analysis may be used to automaticallyrender advice regarding the viability of a data set as used indeveloping a model based on a configurable set of parameters. Forexample, in one embodiment wherein a model is developed using DynamicOutlier Bias Removal for a dataset, the error and correlationcoefficient values for the model dataset and for a representativespurious dataset, calculated under different bias criteria, may be usedto automatically render advice regarding the viability of the data setin supporting the developed model, and inherently, the viability of thedeveloped model in supporting the dataset.

As illustrated in FIG. 5, observing the behavior of these modelperformance values for several cases provides a quantitative foundationfor determining whether the data values are representative of theprocesses being modeled. For example, referring to FIG. 5, the standarderror for the realistic data set at a 100% bias criteria (i.e., no biasreduction), corresponds to the standard error for the random data set atapproximately 65% bias criteria (i.e., 35% of the data values with thehighest errors removed). Such a finding supports the conclusion thatdata is not contaminated.

In addition to the above-described quantitative analysis facilitated bythe illustrative graph of FIG. 5, Dynamic Outlier Bias Reduction can beutilized in an equally, if not more powerful, subjective procedure tohelp assess a dataset's quality. This is done by plotting the modelpredicted values against the data given actual target values for boththe outlier and included results.

FIGS. 6A and 6B illustrate these plots for the 100% points of both therealistic and random curves in FIG. 5. The large scatter in FIG. 6A isconsistent with the arbitrary target values and the resultant inabilityof the model to fit this intentional randomness. FIG. 6B is consistentand common with the practical data collection in that the modelprediction and actual values are more grouped around the line whereonmodel predicted values equal actual target values (hereinafterActual=Predicted line).

FIGS. 7A and 7B illustrate the results from the 70% points in FIG. 5(i.e., 30% of data removed as outliers). In FIGS. 7A and 7B the outlierbias reduction is shown to remove the points most distant from theActual=Predicted line, but the large variation in model accuracy betweenFIGS. 7A and 7B suggests that this dataset is representative of theprocesses being modeled.

FIGS. 8A and 8B show the results from the 50% points in FIG. 5 (i.e.,50% of data removed as outliers). In this case about half of the data isidentified as outliers and even with this much variation removed fromthe dataset, the model, in FIG. 8A, still does not closely describe therandom dataset. The general variation around the Actual=Predicted lineis about the same as in the FIGS. 6A and 7A taking into account theremoved data in each case. FIG. 8B shows that with 50% of thevariability removed, the model was able to produce predicted resultsthat closely match the actual data. Analyzing these types of visualplots in addition to the analysis of performance criteria shown in FIG.5 can be used by analysts to assess the quality of actual datasets inpractice for model development. While FIGS. 5, 6A, 6B, 7A, 7B, 8A, and8B illustrate visual plots wherein the analysis is based on performancecriteria trends corresponding to various bias criteria values, in otherembodiments, the analysis can be based on other variables thatcorrespond to bias criteria values, such as model coefficient trendscorresponding to various bias criteria selected by the analyst.

Various embodiments include a system for reducing outlier bias in targetvariables measured for a facility. FIG. 9 illustrates an examples ofsuch embodiments. The system illustrated in FIG. 9 comprises a computingunit 1012 by which a data set, such as a data set containing variousperformance measurements for an industrial facility, can be processed.The computing unit 1012 comprises a processor 1014 and a storagesubsystem 1016 on which a computer program embodying the Dynamic OutlierBias Removal methodology disclosed herein. The system 1010 comprises aninput unit 1018 that further comprises a measuring device 1020 formeasuring a given target variable and for providing a corresponding dataset. The measuring device 1020 can be configured to measure any targetvariable of interest, such as, for example, the number of parts thatleave an industrial plant facility per time unit, or the volume ofrefined substances produced by a refining facility per time unit. Beyondthat, a plurality of target variables can be measured simultaneously. Inthe embodiment shown the measuring device 1020 comprises a sensor 1022.One of ordinary skill in the art would appreciate the scope of thepresent invention includes various sensors that may be used in measuringvarious physical attributes of material and/or components used in orproduced by industrial facilities, such as, for example, sensors capableof detecting and quantifying a chemical compound, e.g. greenhouse gasemissions. In addition, one of ordinary skill in the art will appreciatethat measuring a target variable of interest includes any means ofcollecting, receiving, measuring, accumulating, and processing data. Thetarget variables, data sets, and data can comprise data of all kinds,including but not limited to industrial process data, computer systemdata, financial data, economic data, stock, bond and futures data,internet search data, security data, voice and other human recognitiondata, cloud data, big data, insurance data, and other data of interest,the scope and breath of the disclosure and invention is not limited tothe type of target variables, data sets or data. One skilled in the artwill also appreciate that the sensor and the measuring device can alsobe or include computers, computer systems, and processors. Moreover, thesystem 1010 comprises an output unit 1024 by which the processed datacan be outputted. The output device may include a monitor, a printer ora transmission device (not shown).

In one embodiment, the system 1010 initiates the sensor 1022 which inturn detect and quantifies a given compound, e.g. carbon dioxide. Thedetection and quantification can be done continuously or within discretetime steps. Each time a measurement is completed, a data set isgenerated, is stored on the storage subsystem 1016, and inputted intothe computing unit 1012. The data set is processed by the DynamicOutlier Bias Removal computer program stored by the storage subsystem1016 whereby it is censored according to the various embodiments of themethods disclosed herein. Once the computer program has processed thedata, the processed data is outputted by the output unit 1024. In anembodiment wherein the output unit 1024 is a monitor or a printer, theresults may be visualized in a diagram. In an embodiment wherein theoutput unit 1024 comprises a transmission device, the processed data issent to a central database or a control center where the data can befurther processed (not shown). Accordingly, the system according to thevarious disclosed embodiments provides a powerful tool to comparedifferent facilities within one company or within one technical fieldwith each other in an automated way wherein outlier bias is reduced.

In a preferred embodiment the measuring device 1020 comprises one ormore sensors for detecting and quantifying a chemical compound. Due tothe global warming, greenhouse gasses emitted by a facility are becomingan increasingly important target variable. Facilities that emit smallamounts of greenhouse gasses may be better ranked than those emittinghigher amounts although the overall productivity of the latter may bebetter. Examples of greenhouse gases are carbon dioxide (CO2), ozone(03), water vapor (H2O), hydrofluorocarbons (HFCs), perfluorocarbons(PFCs), chlorofluorocarbons (CFCs), sulphur hexafluoride (SF6), methane(CH4), nitrous oxide (N2O), carbon monoxide (CO), nitrogen oxides (NOx),and non-methane volatile organic compounds (NMVOCs). The automateddetection and quantification of these compounds may be used to developindustrial standards regarding certain allowable emissions of thegreenhouse gasses. However, applying the Dynamic Outlier Bias Removalleads to removing outliers that may be caused by extraordinarycircumstances in the production such as operating errors or evenaccidents. Thus, using various embodiments disclosed herein results indeveloping more accurate and meaningful standards. Once the industrialstandards are developed, the system can be used to compare the emissionswith the standards.

One of ordinary skill in the art would further appreciate that the scopeof the present invention includes application of the various disclosedembodiments for reducing outlier bias in target variables relating tofinancial instruments, such as equity securities (e.g., common stock) orderivative contracts (e.g., forwards, futures, options, and swaps,etc.). For example, in one embodiment, the system 1010 comprises aninput unit 1018 that receives data relating to a financial instrument,such as a common stock, and provides a corresponding data set. Thetarget variable can be the stock price. Further, variables that relateto the target variable can be determined using various known methods ofevaluating financial instruments, such as, for example, discounted cashflow analysis. Such related variables may include the relevantdividends, earnings, or cash flows, earnings per share,price-to-earnings ratio, or growth rate, etc. Once the database oftarget values and related variable values is formed, various embodimentsof the Dynamic Outlier Bias Removal disclosed herein can be applied tothe database, resulting in a more accurate model to evaluate thefinancial instrument.

The foregoing disclosure and description of the preferred embodiments ofthe invention are illustrative and explanatory thereof and it will beunderstood by those skilled in the art that various changes in thedetails of the illustrated system and method may be made withoutdeparting from the scope of the invention.

We claim:
 1. A system specialized for assessing the viability of a dataset for developing a model for a facility, comprising: an input unit forinputting one or more data sets to be processed, wherein the input unitcomprises a measuring device configured to: measure one or more targetvariables for a facility; and provide a corresponding data set for eachof the target variables; a computing unit coupled to the input unit andfor processing the one or more data sets, wherein the computing unitcomprises a processor and a non-transient storage subsystem; and anoutput unit coupled to the computing unit and for outputting one or moreof the processed data sets received from the computing unit, a computerprogram stored by the non-transient storage subsystem comprisinginstructions, when executed by the processor, cause the systemspecialized for assessing the viability of the corresponding data setfor developing a model to perform at least the following: generate arandom data set from the corresponding data set; obtain a set of biascriteria values used to determine one or more outliers; perform dynamicoutlier bias reduction on the corresponding data set for one or morebias criteria values of the set of bias criteria values to generate oneor more outlier bias reduced target data sets; perform dynamic outlierbias reduction on the random data set for the one or more bias criteriavalues of the set of bias criteria values to generate one or moreoutlier bias reduced random data sets; calculate a set of target errorvalues for the one or more outlier bias reduced target data sets and aset of random error values for the one or more outlier bias reducedrandom data sets; calculate a set of target correlation coefficients forthe one or more outlier bias reduced target data sets and a set ofrandom correlation coefficients for the outlier bias reduced random dataset; construct a first bias criteria curve for the corresponding dataset and a second bias criteria curve for the random data set from theone or more bias criteria values, the set of target error values, theset of random error values, the set of target correlation coefficients,and the set of random correlation coefficients; and compare the firstbias criteria curve and the second bias criteria curve for determiningviability of the corresponding data set used to develop the model. 2.The system of claim 1, wherein the output unit is configured display aplot for the first bias criteria curve and the second bias criteriacurve.
 3. The system of claim 1, wherein the measuring device comprisesa sensor configured to detect a compound corresponding to one of thetarget variables and quantify the compound corresponding to the one ofthe target variables.
 4. The system of claim 1, wherein the compound isa greenhouse chemical gas compound, and wherein the sensor is furtherconfigured to detect and quantify the compound corresponding to the oneof the target variables continuously.
 5. The system of claim 1, whereinthe instructions, when executed by the processor, cause the systemspecialized for assessing the viability of the corresponding data setfor developing the model to translate the comparison of the first biascriteria curve and the second bias criteria curve to an automated advicemessage that indicates the viability of the corresponding data set usedto develop the model.
 6. The system of claim 1, wherein theinstructions, when executed by the processor, cause the systemspecialized for assessing the viability of the corresponding data setfor developing the model to perform dynamic outlier bias reduction onthe corresponding data set for the one or more bias criteria values ofthe set of bias criteria values to generate the one or more outlier biasreduced target data sets by performing at least the following: for eachof the one or more bias criteria values: generate a plurality of modelpredicted values for the corresponding data set by applying the model tothe corresponding data set; compute a plurality of error valuesdetermined from the corresponding data set and the model predictedvalues; compare the error values with the corresponding bias criteriavalue; remove outliers within the corresponding data set to form thecorresponding outlier bias reduced target data set determined from thecomparison of the error values with the corresponding bias criteriavalue; and optimize the model to from an updated model determined fromthe corresponding outlier bias reduced target data set.
 7. The system ofclaim 6, wherein the instructions, when executed by the processor, causethe system specialized for assessing the viability of the correspondingdata set for developing the model to perform dynamic outlier biasreduction on the corresponding data set for the one or more biascriteria values of the set of bias criteria values to generate the oneor more outlier bias reduced target data sets by performing at least thefollowing: for each of the one or more bias criteria values: compare theerror values with a predefined termination criteria to determinetermination of optimizing the model; and generate a plurality of secondmodel predicted values for the corresponding data set by applying theupdated model to the corresponding data set when the comparison of theerror values and the predefined termination criteria do not representtermination of optimizing the model.
 8. The system of claim 1, whereinthe instructions, when executed by the processor, cause the systemspecialized for assessing the viability of the corresponding data setfor developing the model to compare the first bias criteria curve andthe second bias criteria curve for determining viability of thecorresponding data set used to develop the model by performing at leastthe following: determine a first bias criteria value on the first biascriteria curve that corresponds to a first target error value of the setof target error values; determine a second bias criteria value on thesecond bias criteria curve that corresponds to a first random errorvalue of the set of random error values; and compare the first biascriteria value with the second bias criteria value, wherein the firsttarget error value and the first random error value are the same.
 9. Thesystem of claim 1, wherein the instructions, when executed by theprocessor, cause the system specialized for assessing the viability ofthe corresponding data set for developing the model to determine theinfluence of the dynamic outlier bias reduction for each bias criteriavalue by performing at least the following: comparing a number ofiterations to optimize the model for each of the bias criteria valuesand comparing the differences in the set of target correlationcoefficients.
 10. The system of claim 1, wherein the random data setcomprises all random data values based on the corresponding data set,and wherein the instructions, when executed by the processor, cause thesystem specialized for assessing the viability of the corresponding dataset for developing the model to perform dynamic outlier bias reductionon the random data set for the one or more bias criteria values of theset of bias criteria values to generate the one or more outlier biasreduced random data sets by performing at least the following: for eachof the bias criteria values: generate a plurality of model predictedvalues for the random data set by applying the model to the random dataset; compute a plurality of error values using the random data set andthe model predicted values; compare the error values with thecorresponding bias criteria value; remove outliers within the randomdata set to form the corresponding outlier bias reduced random data setdetermined from the comparison of the error values with thecorresponding bias criteria value; and optimize, by the speciallyprogrammed computing system, the model for form an updated model basedon the corresponding outlier bias reduced random data set.
 11. Thesystem of claim 1, wherein at least one of the set of target error valueis a standard error, and wherein at least one of the set of targetcorrelation value is a coefficient of determination value.
 12. Thesystem of claim 1, wherein the random data set comprises a plurality ofrandom data values generated within a range of a plurality of predictedvalues of the model.
 13. A system for specialized for assessing theviability of a target data set for developing a mode for a financialinstrument, comprising: an input unit configured to receive a targetdata set corresponding to a financial instrument, wherein the targetdata set comprises a plurality of data values for at least one targetvariable corresponding to the financial instrument; a computing unitcoupled to the input unit, wherein the computing unit comprises aprocessor and a non-transient storage subsystem, a computer programstored by the non-transient storage subsystem comprising instructions,when executed by the processor, cause the system specialized forassessing the viability of the target data set for developing a model toperform at least the following: generate a random data set based on thetarget data set; receive a plurality of bias criteria values used todetermine one or more outliers; produce a plurality of outlier biasreduced target data sets that are associated with the bias criteriavalues by applying a mathematical model and a dynamic outlier biasreduction to the target data set; produce a plurality of outlier biasreduced random data sets that are associated with the bias criteriavalues by applying the mathematical model and the dynamic outlier biasreduction to the random data set; calculate at least one target errorvalue for each of the outlier bias reduced target data sets and at leastone random error value for each of the outlier bias reduced random datasets; calculate at least one target correlation value for each of theoutlier bias reduced target data sets and at least one randomcorrelation value for each of the outlier bias reduced random data sets;construct a first bias criteria curve for the target data set on a graphbased on the at least one target error value and the at least one targetcorrelation value for each of the outlier bias reduced target data sets;construct a second bias criteria curve for the random data set on thegraph based on the at least one random error value and the at least onerandom correlation value for each of the outlier bias reduced randomdata sets; and compare the first bias criteria curve and the second biascriteria curve to determine viability of the target data set used forthe mathematical model.
 14. The system of 13, wherein the financialinstrument is a common stock, and wherein the target variable is theprice of the common stock, and wherein the target variable for thefinancial instrument represents at least one of: dividends, earnings,cash flow, earnings per share, price-to-earnings ratio, and growth rate.15. The system of claim 13, wherein the output unit is configureddisplay a plot for the first bias criteria curve and the second biascriteria curve.
 16. The system of claim 13, wherein the instructions,when executed by the processor, cause the system specialized forassessing the viability of the target data set for developing the modelto produce a plurality of outlier bias reduced target data sets that areassociated with the bias criteria values by applying a mathematicalmodel and a dynamic outlier bias reduction to the target data set byperforming at least the following: for each of the one or more biascriteria values: generate a plurality of model predicted values for thetarget data set by applying the mathematical model to the target dataset; compute a plurality of error values determined from the target dataset and the model predicted values; compare the error values with thecorresponding bias criteria value; remove outliers within the targetdata set to form the corresponding outlier bias reduced target data setdetermined from the comparison of the error values with thecorresponding bias criteria value; and optimize the mathematical modelto from an updated mathematical model determined from the correspondingoutlier bias reduced target data set.
 17. The system of claim 13,wherein the instructions, when executed by the processor, causespecialized for assessing the viability of the target data set fordeveloping the model to compare the first bias criteria curve and thesecond bias criteria curve for determining viability of the target dataset used to develop the model by performing at least the following:determine a first bias criteria value on the first bias criteria curvethat corresponds to the at least one target error value; determine asecond bias criteria value on the second bias criteria curve thatcorresponds to the at least one random error values, and compare thefirst bias criteria value with the second bias criteria value, whereinthe at least one target error value and the at least one random errorvalue are the same.
 18. A system for reducing outlier bias in targetvariables measured for a facility, comprising: an input unit forinputting one or more data sets to be processed, wherein the input unitcomprises a measuring device configured to: measure one or more targetvariables for the facility; and provide a corresponding data set foreach of the target variables; a computing unit coupled to the input unitand for processing the one or more data sets, wherein the computing unitcomprises a processor and a non-transient storage subsystem; an outputunit coupled to the computing unit and for outputting one or more of theprocessed data sets received from the computing unit; and a computerprogram stored by the non-transient storage subsystem comprisinginstructions, when executed by the processor, cause the systemspecialized for reducing outlier bias in target variables measured forthe facility to perform at least the following: receive at least oneerror threshold criteria and the corresponding data set via thedatabase; perform a first iteration of outlier bias reduction for thecorresponding data set that comprises: determining a set of predictedvalues by applying a model comprising at least one coefficient to thedata set; comparing the set of predicted values to the data set toproduce at least one set of error values; removing a plurality of dataoutliers from the data set determined from the at least one set of errorvalues and the at least one error threshold criteria to generate anoutlier filtered data set; and constructing an updated model comprisingat least one updated coefficient from the outlier filtered data set; andperform a second iteration of outlier bias reduction for the data setbased upon a determination that at least one termination criteria is notsatisfied, wherein performing the second iteration of outlier biasreduction comprises determining a set of second predicted values byapplying the updated model to the data set.
 19. The system of claim 18,wherein the instructions, when executed by the processor, cause thesystem specialized for reducing outlier bias to perform the seconditeration of outlier bias reduction for the corresponding data set thatfurther comprises recombining the outlier filtered data set with thedata outliers to produce the data set.
 20. The system of claim 18,wherein the instructions, when executed by the processor, cause thesystem specialized for reducing outlier bias to perform the seconditeration of outlier bias reduction for the corresponding data set thatfurther comprises: comparing the set of second predicted values to thecorresponding data set to produce at least one set of second errorvalues; removing a plurality of second data outliers from thecorresponding data set determined from the at least one set of seconderror values and the at least one error threshold criteria to generate asecond outlier filtered data set; and constructing second iterationupdated model comprising at least one second updated coefficient fromthe second outlier filtered data set.